Using machine learning algorithms to predict heart failure from medical records

1) Main objective of the analysis and machine learning task

The goal of this project to apply machine learning techniques to electronic medical health records of patients with heart failure condition and then find the best model for such prediction task. This work is also intended to replicate the paper published in February 2020 from Chicoo & Jurman which was mostly done in R notebook.

The dataset used in this project is publicly available here under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

2) Brief description of the dataset and a summary of its attributes

Attribute Information:

Thirteen (13) clinical features:

Exploratory Data Analysis

Looks like we do not have missing values. But the data type for 'age' is float64 so we will need to change the float data type to integer.

According to the dataset owner, the patients consist of 105 female and 194 male. This correspond with the following label 0 for female and 1 for male.

Random Forest Feature Importance

eXtreme Gradient Boosting Feature Importance

Model training and selection

For the regression model, we will be experimenting against Random Forest Classifier, eXtreme Gradient Boosting, Voting Classifier, Näive Bayes Classifier and Decision Tree Classifier.

K Nearest Neighbor Classifier

eXtreme Gradient Boosting

Näive Bayes Classifier

3) Brief summary of data exploration and actions taken for data cleaning and feature engineering

  1. We changed the age data type from float to integer.

  2. We mapped the follow up period in the 'time' feature to represent one month instead of days.

  3. We created two new dataframes that represented patients who died and also those survived heart failures. Then from these dataframes, we conducted a series of analysis including categorizing the survival rate into month range in order to understand the descrepancies between the survivors and the deceased.

  4. We used Random Forest Regressor, Random Forest Classifier, XGBoost and SHAP classifier to find important features. Based on the result of the feature importance, we have identified that the most common top features are serum_creatinine, age, ejection_fraction and creatinine_phosphokinase.

  5. We compared the selected features in order to identify the descrepancies of prediction accuracy. In this case, we experimented with three different groups of features:

    a) serum_creatinine, ejection_fraction

    b) serum_creatinine, age, ejection_fraction

    c) serum_creatinine, age, ejection_fraction, creatinine_phosphokinase

4) Summary of training at least three classification models (in terms of explainability and predictability)

  1. We have utilized 6 types of classifiers:
  1. In our feature selection, we have identified that serum ceratinine and ejection fraction are two most prominent features in the data. We perform also GridSearchCV to tune the hyperparameters. For Voting Classifier, we performed two separate training procedure for both hard and soft voting ensembles.

5) Final model recommendation

Voting Classifier with soft voting ensemble performs the best amongst all other classifiers in the selected list with an F1-score of 0.83. Random Forest Classifier's accuracy score on the other hand is at 0.80. In the medical use cases, recall is a much more important metric than the accuracy score because having to predict better actual positives ensures that we did not let any positive cases go undetected.

In this case, the recall result for Voting Classifier is 0.89 and 0.76 for non-death and death events respectively, as compared as to Random Forest classifier with 0.89 and 0.68 respectively.

As for specificity measure, both Voting Classifier and Random Forest Classifier have the same result with 0.89. Both of the classifiers also have the same sensitivity score of 0.68.

However, we would like to point out that the dataset is an imbalanced dataset, so we would like to conduct further investigation in this.

6) Summary Key Findings and Insights

  1. Models that were trained on three features (serum ceratinine and ejection fraction are the two main contributing features and follow up time) outperformed the models trained with all features. This statement aligned with the findings from the research paper whereby only few features are needed. However, in the paper, the authors argued that two features (serum ceratinine and ejection fraction) are enough for the prediction task.
  2. We would argue that some caution is needed as we find the two features have yielded poorer results: for example when using Random Forest, the specificity and sensitivity results are worst (0.83 and 0.64 respectively).

7) Suggestions for next steps in analyzing this data